Session Summary: How BMW Group uses AWS serverless analytics for a data-driven ecosystem #ANT310 #reinvent

AWS re:Invent 2020

#レポート

#AWS Glue

#Amazon Athena

#Amazon SageMaker

#AWS

春田拓海 | Takumi

2020.12.13

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

This post is the session report about ANT310: How BMW Group uses AWS serverless analytics for a data-driven ecosystem at AWS re:Invent 2020.

日本語版はこちらから。

Abstract

Data is the lifeblood fueling BMW Group’s digital transformation. It drives BMW’s personalized customer experiences, connected mobility solutions, and analytical insights. This session walks through the journey of building BMW Group’s Cloud Data Hub. BMW Group’s technical lead, Simon Kern, dives deep into how the company is leveraging AWS serverless capabilities for delivering ETL functions on big data in a modularized, accessible, and repeatable fashion and provides insight into the next steps of the journey. The services used in BMW Group’s AWS architecture are discussed, including AWS Glue, Amazon Athena, Amazon SageMaker, and more.

Speakers

Simon Kern
- Lead DevOps Engineer - BMW Group

How BMW Group uses AWS serverless analytics for a data-driven ecosystem - AWS re:Invent 2020

Content

BMW Group IT: Brief intro
Cloud Data Hub: BMW Group’s central data lake
Orchestrating data
Ingesting and preparing data
Analyzing data
Outlook

BMW Group IT: Brief intro

BMW Group is a global mobility company in 29 countries where there are 60 nationalities. It is also called as IT company because 694 locations are are connected through the global IT network and it delivers over 230 software products. One of the most important services is BMW’s ConnectedDrive backend where over 14 million vehicles are connected to and over 1 billion requests are being served per day.

-> BMW Group produces a lot of data with the backend systems.
-> Ingest the data into data lake to organize and analyze together
-> Cloud Data Hub

Cloud Data Hub

Cloud native data lake that makes it easy to...
- Ingest data
- Have a scalable storage solution
- Open many possibilities to get values out of the data
Left Side
- Over 500 software and data engineers
- Build data ingests and data preparations to fuel a data marketplace
Right Side
- Over 5,000 business analysts and data scientists
- Build use cases, machine learning models and AI products
-> Data democratization: easy to access all the data in the BMW Group

You can work with the data seamlessly from the portal below.

Internal architecture consists of 3 pillars; the data providers, the data consumers and the data portal and APIs. All of them is done with AWS multi-tiered account setup.

Orchestrating data

Data Providers (Left Side)
- Global IT unit that provide central datasets
- Local IT unit
Use Cases (Right Side)
- Controlled by the access management layer to allow to see the data
- Global uses Global datasets
- US uses Global and Local datasets
Data portal and API (Center)
- Important for customers to...
  - Explore and query the data
  - Manage metadata
  - Deploy infrastructure
- Built on top of several APIs
- Security, central compliance services
- Single sign-on for all users
- Gray boxes
  - Separate markets and legal entities into different hubs with own storage accounts and processes.
- Unified seamless front end

Dataset
- A combination of S3 buckets and Glue databases
- Always lives inside the Hub
- Assigned to a business object which categorizes the data into a separate unit
- 3 types of layers, and every data set is a part of them
  - Source: the copy of the source system
  - Prepared: clean and harmonize data
  - Semantic: make data enriched by aggregation or join

Ingesting and preparing data

Key concepts of data ingestion
- Ease of use
  - Ingestion kickstart via UI-accessible and Typescript-based CDK stack
  - Advanced features can be leveraged via Terraform modules
- Flexibility
  - Specialized building blocks
  - Reusability via modularization
- Maintainability
  - Community via internal open source
  - Bigger changes via forks

Ingestion from on-premise to CDH Core account
- All setup via Terraform
- Glue ETL
  - Running in a private VPC
  - Read and pull data from on-premise network
  - Store it in the central S3 bucket of Cloud Data Hub
- Secrets Manager
  - Store database credentials
- Cloud Watch
  - Logging and trigger Glue jobs
- Glue Data Catalog
  - Lambda syncs catalogs in Provider and CDH Core
- Independent security account
  - Store KMS keys
- PII API
  - Encrypt sensitive data

Data preparation
- Set up via another Terraform module
- Read data from the central S3 Bucket (source layer)
- Write it into the prepare layer datasets

Ingestion challenges
- S3 object ownership problem caused by multi accounts
  - The object ownership doesn't match the bucket ownership
  - Bucket policy don't apply and IAM role gets cumbersome to configure Glue jobs
  - Recently S3 updates that the role switching is not necessary anymore
    - Amazon S3 Object Ownership is available to enable bucket owners to automatically assume ownership of objects uploaded to their buckets
- Job sizing
  - Choose the right number of DPUs
  - Hard to automate, then based on best practices
  - Light-weight ETL via Spark orchestrated on AWS Fargate
- Small files
  - Build the compaction module which is running on Glue and Athana

Ingestion recap
- Reusable building blocks for common tasks
- Multi-account setup
  - Infrastructure isolation
  - Scale - out to whole organization
  - Team empowerment
- >150 systems ingested
- >1 PB data volume total
  - ~ 100 TB data volume via PySpark-based ETL
  - Residual data via stream-based ingests

Analyzing data

Analyses via
- Amazon Athena
- Amazon SageMaker (optionally with Amazon EMR or AWS Glue ETL development endpoints)
- Amazon QuickSight
Challenge: Moving from exploration to production for non-experts
- Easy integration in CI / CD for ingestion and transformation
- Managed environment for creating new versions of data products

If you'd like to see the analyses demo, please check the actual session out.

Architecture
- Central Data Portal and Bitbucket rollout different Code Pipeline
  - Exploration: Explore the data and build transformation codes
  - Development: Verify the transformation with Athena
  - Production: Deploy after the verification

Data analysis recap
- Enable non-experts via managed toolstack
- Built-in best practices
- Predefined path into production
- Empower experts to utilize the full power of AWS services

Outlook

Integration of AWS Lake Formation for Fine-grained access control
Fine-grained data lineage via Spark execution plans
Automated data monitoring (including frequent users, dataset updates, health, and statistics)
Query acceleration layer for better performance with established BI tools

AWS re:Invent 2020 is being held now!

You wanna watch the whole session? Let's jump in and sign up AWS re:Invent 2020!

AWS re:Invent | Amazon Web Services

Session Summary: How BMW Group uses AWS serverless analytics for a data-driven ecosystem #ANT310 #reinvent

Abstract

Speakers

Content

BMW Group IT: Brief intro

Cloud Data Hub

Orchestrating data

Ingesting and preparing data

Analyzing data

Outlook

AWS re:Invent 2020 is being held now!

関連記事

主なカテゴリ

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

お問い合わせ

運営会社